Hugging Face: From Zero to Fine-tuning

Welcome to this comprehensive yet concise tutorial on Hugging Face! By the end of this notebook, you'll have a solid understanding of the core concepts and be able to use the Hugging Face ecosystem for your own NLP projects. We'll cover the essentials, from the pipeline function for quick and easy inference to loading models and tokenizers for more custom tasks.

1. Setting Up Your Environment

# Uncomment the following line to install necessary libraries
# !pip install transformers datasets torch torchvision torchaudio python-dotenv numpy==1.26.4 accelerate -q
import os
from dotenv import load_dotenv
from huggingface_hub import login

# Load environment variables from .env file
load_dotenv()
hf_token = os.getenv('HUGGINGFACE_ACCESS_TOKEN')

if not hf_token:
    raise ValueError('HUGGINGFACE_ACCESS_TOKEN not found in .env file')

login(token=hf_token)

2. The pipeline Function: Your Gateway to NLP

The pipeline function is the easiest way to get started with Hugging Face. It abstracts away most of the complexity and allows you to perform a wide range of tasks with just a few lines of code. Let's explore some of the most common tasks:

2.1. Sentiment Analysis

from transformers import pipeline

sentiment_analyzer = pipeline("sentiment-analysis", model="distilbert-base-uncased-finetuned-sst-2-english")
result = sentiment_analyzer("Hugging Face is awesome!")
print(result)
Device set to use mps:0
[{'label': 'POSITIVE', 'score': 0.9998737573623657}]

2.2. Text Generation

from transformers import pipeline
import warnings
warnings.filterwarnings('ignore')  # Suppress warnings
import logging
logging.getLogger('transformers').setLevel(logging.ERROR)  # Suppress transformers logs

text_generator = pipeline("text-generation")
result = text_generator("In a world where AI is becoming more prevalent,", max_length=50)

# Only print the generated text
if isinstance(result, list) and len(result) > 0 and 'generated_text' in result[0]:
    print(result[0]['generated_text'])
else:
    print(result)
In a world where AI is becoming more prevalent, AI is also evolving into a whole new field, and we are seeing a lot of AI evolving into a whole new field.

What's your takeaway from this?

I think that as AI grows, we will see a lot more of it. The more we learn about the problem of AI, the more we will realize how important it is to learn about AI. That's a good thing because there may be more of it.

If you can't get it to become a part of your life, then you're not really there.

Do you think it's possible that AI will be able to do more to drive the next AI breakthrough?

If the AI breakthroughs are to happen, I'm not sure that we will.

"The future is always being explored and developed"

Yes, yes it is.

How do you think the AI revolution has changed the way we think about AI?

I think that's a good question. I think that the AI revolution has changed the way we think about AI.

The future is always being explored and developed.

If you look at the way human beings have been doing this for a long time, and we're slowly making that transition, then

2.3. Named Entity Recognition (NER)

ner_pipeline = pipeline("ner", grouped_entities=True)
result = ner_pipeline("My name is John Doe and I live in New York City.")
print(result)
config.json:   0%|          | 0.00/998 [00:00<?, ?B/s]
model.safetensors:   0%|          | 0.00/1.33G [00:00<?, ?B/s]
tokenizer_config.json:   0%|          | 0.00/60.0 [00:00<?, ?B/s]
vocab.txt: 0.00B [00:00, ?B/s]
[{'entity_group': 'PER', 'score': 0.9970439, 'word': 'John Doe', 'start': 11, 'end': 19}, {'entity_group': 'LOC', 'score': 0.9993253, 'word': 'New York City', 'start': 34, 'end': 47}]

3. Under the Hood: Models and Tokenizers

While the pipeline function is great for quick tasks, you'll often need more control over the model and tokenizer. Let's see how to load them manually.

from transformers import AutoTokenizer, AutoModelForSequenceClassification

# Load the tokenizer and model
model_name = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Prepare the input
text = "Hugging Face is a great tool for NLP."
inputs = tokenizer(text, return_tensors="pt")

# Make the prediction
outputs = model(**inputs)
logits = outputs.logits
predicted_class = logits.argmax().item()

print(f"Predicted class: {predicted_class}")
Predicted class: 1
# Example with negative sentiment (should predict class 0)
text_neg = "This is a terrible experience. I am very disappointed."
inputs_neg = tokenizer(text_neg, return_tensors="pt")
outputs_neg = model(**inputs_neg)
logits_neg = outputs_neg.logits
predicted_class_neg = logits_neg.argmax().item()
print(f"Predicted class: {predicted_class_neg}")
Predicted class: 0

The output Predicted class: 1 means that the model classified your input text ("Hugging Face is a great tool for NLP.") as belonging to class 1. In most sentiment analysis models like distilbert-base-uncased-finetuned-sst-2-english, class 1 usually represents "positive" sentiment, while class 0 represents "negative" sentiment.

4. Fine-Tuning a Model

Fine-tuning allows you to adapt a pre-trained model to your specific dataset. Here's a conceptual overview of the process. For a hands-on example, we'll use the datasets library to load a dataset and then fine-tune a model on it.

import warnings
warnings.filterwarnings('ignore')  # Suppress all warnings
import logging
logging.getLogger('transformers').setLevel(logging.ERROR)  # Suppress transformers logs
logging.getLogger('datasets').setLevel(logging.ERROR)  # Suppress datasets logs
logging.getLogger('filelock').setLevel(logging.ERROR)  # Suppress filelock logs

from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments

# Load a dataset
dataset = load_dataset("imdb", split="train[:1%]")

# Load a tokenizer and model
model_name = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)

# Preprocess the dataset
def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

tokenized_dataset = dataset.map(tokenize_function, batched=True)

# Set up the training arguments
training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=1,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir="./logs",
)

# Create the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset,
)

# Fine-tune the model
trainer.train()
{'train_runtime': 44.6356, 'train_samples_per_second': 5.601, 'train_steps_per_second': 0.358, 'train_loss': 0.6679916381835938, 'epoch': 1.0}
TrainOutput(global_step=16, training_loss=0.6679916381835938, metrics={'train_runtime': 44.6356, 'train_samples_per_second': 5.601, 'train_steps_per_second': 0.358, 'train_loss': 0.6679916381835938, 'epoch': 1.0})

5. Conclusion

Congratulations! ๐ŸŽ‰ You've now learned the fundamentals of Hugging Face. You can use the pipeline for quick inference, load models and tokenizers for more control, and even fine-tune pre-trained models on your own data. This is just the beginning of your journey with Hugging Face. Explore the Hugging Face Hub website to discover thousands of models and datasets, and dive deeper into the documentation to unlock the full power of the ecosystem.

Happy coding!